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Abstract 

Deep  learning  solutions  based  on  deep  neural  networks  (DNN) 
and  deep  stack  networks  (DSN)  were  investigated  for 
classifying  taiget  images  in  a  non -time -locked  rapid  serial 
visual  presentation  (RSVP)  image  target  identification  task 
using  EEG.  Several  feature  extraction  methods  associated  with 
this  task  were  implemented  and  tested  for  deep  learning,  where 
a  sliding  window  method  using  the  trained  classifier  was  used 
to  predict  the  occurrence  of  taiget  events  in  a  non -time -locked 
fashion..  The  deep  learning  algorithms  e^lored  based  on  deep 
stacking  networks  were  able  to  improve  the  error  rate  by  about 
5%  over  existing  algorithms  such  as  linear  discriminant 
analysis  (LDA)  for  this  task.  Initial  test  results  also  showed  that 
this  method  based  on  deep  stacking  networks  for 
non-time-locked  classification  can  produce  an  error  rate  close 
to  that  achieved  for  time -locked  classification,  thus  illustrating 
the  power  of  deep  learning  for  complex  feature  spaces. 

Index  terms  -  RSVP,  non -time -locked  events,  feature  selection, 
deep  learning,  deep  neural  networks,  deep  stacking  networks, 
brain -computer  interaction. 

I.  Introduction 

A  brain  computer  interaction  (BCI)  system  allows  human 
subjects  to  communicate  with  or  control  an  external  device 
with  their  brain  signals  [1],  or  to  use  those  brain  signals  to 
interact  with  computers,  environments,  or  even  other  humans 
[2].  One  application  of  BCI  is  to  use  brain  signals  to  distinguish 
target  images  within  a  large  collection  of  non-target  images  [2]. 
Such  BCI-based  systems  can  drastically  increase  the  speed  of 
target  identification  in  large  image  databases  over  manual 
procedures  [3].  Data  collection  for  training  such  BCI  systems  is 
commonly  carried  out  using  the  rapid  serial  visual  presentation 
(RSVP)  paradigm  [2],  where  test  subjects  are  asked  to  identify 
a  target  image  from  a  continuous  burst  of  image  clips  presented 
at  a  high  rate.  The  EEG  recordings  are  collected  and  a  classifier 
capable  of  predicting  the  presence  of  taiget  images  based  on 
EEG  responses  is  trained  using  this  data. 

Classification  for  RSVP  data  is  usually  performed  in  a 
‘time -locked’  fashion,  by  analyzing  the  spectrum  or  amplitude 
in  the  EEG  signal  300-1 000ms  immediately  following 
presentation  of  taiget  and  non -target  images.  Although 
processing  RSVP  data  without  time  locking  is  more  realistic, 
non -time-locked  classification  is  significantly  more  demanding 
because  target  event  timing  needs  to  be  explicitly  or  implicitly 


estimated.  So  far,  a  number  of  classifiers  including  logistic 
regression,  linear  discriminant  analysis  (LDA)  and  support 
vector  machines  (SVM)  have  been  proposed  in  the  literature  to 
address  the  classification  of  time -locked  events  [2,  4].  These 
classifiers  have  been  reported  to  provide  less  than  10%  error 
rate.  The  problem  of  imperfectly  time -locked  events  was 
considered  in  [5],  where  event  timing  was  assumed  to  be 
unknown  but  occurring  within  a  small  known  interval. 
Performance  close  to  that  achieved  under  perfect  time  locking 
were  reported.  However,  classification  for  completely 
non-time -locked  events  has  as  ofyet  not  been  addressed. 

In  this  paper,  we  investigate  deep  learning  (DL)  solutions  to 
non -time -locked  RSVP  classification.  Deep  learning  is  a  term 
for  a  new  family  of  learning  methods  that  have  been  shown  to 
offer  superb  representation  of  complex  data  bv  using  a 
multiple -layered  architecture  [6,  7j.  DL  has  gained  great 
interest  in  recent  vears  due  to  its  abilitv  to  outperform 
alternative  classification  methods  in  several  machine  learning 
competitions  and  in  a  variety  of  applications,  including  image 
classification  and  speech  recognition  [8,  9].  However,  DL 
applications  for  EEG  data  analysis  are  at  a  veiy  early  stage.  It  is 
not  clear  if  and  how  unique  characteristics  of  EEG  data 
including  high  dimensional  feature  spaces,  temporal  and  spatial 
data  correlation,  and  excessive  noise  will  affect  the 
implementation  and  performance  of  DL  algorithms.  The  goal  of 
this  work  is  two-fold.  First,  we  aim  to  develop  solutions  to 
classify  non-time-locked  events  in  RSVP.  Second,  we  intend  to 
investigate  the  use  of  DL  algorithms  for  EEG  data  analysis  in 
BCI  research. 

II.  Material  and  methods 

A.  Experimental  Design 

The  RSVP  EEG  recordings  were  obtained  from  [2],  which 
include  brain  activities  of  five  participants  presented  with  a 
series  of  bursts  of  images  in  an  RSVP  paradigm.  Each  burst 
lasts  for  4.1s  and  consists  of  49  images  presented  at  a  speed  of 
12  images  /second.  A  buist  may  contain  zero  or  one  taiget 
images,  where  a  target  image  includes  a  silhouette  of  airplane 
which  is  not  present  in  non -target  images.  To  ensure  no 
interference  from  buist  edges,  the  taiget  image  is  not  presented 
within  500  milliseconds  (ms)  from  the  onset  and  offset  of  the 
burst.  EEG  recordings  were  collected  using  a  BIOSEMI 
ActiveTwo  system  with  256  electrodes  at  256  Hz  sampling  rate 
with  24-bit  digitization. 
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B.  Data  Preprocessing  and  Prediction  Objective 

The  raw  data  include  7.1s  EEG  data  epochs,  each  centered  on  a 
4.1s  RSVP  burst.  The  data  was  first  bandpass -filtered  in  the 
range  2-100Hz.  Independent  component  analysis  (ICA)  using 
the  Extended  Infomax  Algorithm  in  EEG  LAB  [10]  was  then 
performed  to  reduce  the  correlation  between  channels  and  to 
remove  noise.  The  16  components  with  highest  variance  were 
retained.  To  capture  the  time-frequency  space  characteristics,  a 
wavelet  transformation  was  applied  to  the  ICA  transformed 
data  to  obtain  a  temporal-IC  power  spectrum  for  18  frequency 
bands  evenly  sampled  on  a  logarithmic  scale  from  2-100Hz. 
Only  5s  of  the  7.1s  data  epochs  were  extracted  from  the 
recording  of  subject  1-5,  where  for  taiget  epochs,  the  taiget 
event  onset  time  was  at  2s  of  the  5s  epoch.  There  were  a  total 
of  138,  129,  114,  121,  and  145  target  epochs  for  subject  1  to  5, 
and  188,  171,  194,  188,  and  190  non-taiget  epochs  for  subject  1 
to  5,  respectively.  Our  goal  is  to  predict  if  and  when  a  target 
image  is  present  in  a  5 -second  epoch.  The  transformed  data  of 
each  epoch  represented  the  power  of  EEG  recording  distributed 
in  three  dimensions:  independent  components  (ICs),  frequency, 
and  time. 

C.  Construction  of  training  data. 

We  developed  a  solution  based  on  sliding  windows,  where  a 
500- ms  long  window  slides  from  the  beginning  to  the  end  of  an 
epoch  at  a  step  size  of  1  sample.  For  each  slide  location,  the 
PEG  data  within  the  window  is  subjected  to  a  class ifrer  to 
predict  if  a  target  image  is  present.  The  key  to  this  solution  is  to 
train  a  classifier  that  can  predict  a  target  event  if  the 
event-related  brain  response  occurs  within  the  500ms  window 
of  the  input  EEG  data.  What  makes  this  training  difficult  is  that 
the  response  can  happen  at  any  place  within  the  window.  To 
constmct  a  training  set,  we  define  the  target  event  region  as  the 
region  from  the  event  onset  time  (2s)  to  one  second  (Is) 
afterwards.  Since  a  target  image  can  appear  200- 300ms  after 
the  onset  time  and  a  target-related  potential  is  known  to  occur 
300- 500ms  after  an  image  appears  in  each  of  the  target  epochs, 
a  500ms  window  within  the  target  event  region  should  contain 
event-related  brain  response.  To  account  for  potential  offset 
between  the  sliding  window  and  the  target  event,  we  also 
investigated  the  target  event  region  with  200ms  offset,  which 
covers  200ms  before  target  onset  to  300ms  after  target  onset. 
For  each  of  the  defined  target  event  regions,  50  random 
sections  of  500ms -long  EEG  data  were  randomly  taken  from 
the  one-second  target  event  region  and  labeled  as  target  event. 
Next,  50  non-target  labeled  data  samples  of  500ms  windows 
were  extracted  randomly  from  the  5-second  non -target  epochs. 
This  was  done  for  each  target/non -target  epochs  and  in  the  end, 
two  training  datasets  corresponding  to  offsets  of  0s  and  200ms, 
respectively,  were  obtained  for  each  subject.  Each  set  includes 
6750,  6450,  5700,  6050,  7250  target  event  samples  and  9250, 
8550,  9700,  9400,  9500  non-target  event  samples  for  subject  1 
to  5,  respectively,  where  a  sample  contains  IC-time -frequency 
powers  for  500ms  resulting  in  a  36864  (16  ICs  *1 8  frequencies 
x  128  time  samples)  dimensional  feature  vector.  The 
classification  output  includes  two  labels:  target  or  non -target 
event. 


D.  Deep  Learning  Algorithms 

Deep  learning  is  a  term  that  refers  to  a  class  of  new  machine 
learning  algorithms  that  e^loit  architectures  of  layered 
modules  of  supervised  or  unsupervised  learning  algorithms. 
Depending  on  the  learning  nature  of  each  module,  the  existing 
deep  learning  algorithms  can  be  classified  into  generative, 
discriminative,  and  hybrid  architectures  [7].  Generative 
architectures  include  the  deep  auto -encoder,  which  consists  of 
layers  of  neural  networks,  whose  output  has  the  same 
dimension  as  the  input.  The  main  objective  of  deep 
auto-encoders  is  to  extract  features  from  data  as  opposed  to 
classification.  Deep  stacking  networks  (DSN)  exemplify 
discriminative  architectures  [11-13].  For  DSN,  each  module  is 
a  classifier  that  takes  a  simplified  multilayer  perceptron,  which 
includes  a  shallow  sigmoidal  neural  network  followed  by  a 
linear  classifier.  The  hybrid  class  includes  the  well-known  deep 
neural  network  (DNN),  which  consists  of  layers  of  restricted 
Boltzmann  machines  with  a  classification  module  at  the  very 
top.  A  survey  of  these  algorithms  can  be  found  in  [6]. 

In  this  paper  we  will  focus  on  the  DNN  and  the  DSN.  Two 
variations  of  DSN,  one  with  a  linear  (DSN-L)  and  one  with  a 
sigmoidal  activation  function  (DSN-S)  were  implemented. 
When  applying  DL  to  EEG  data,  high  feature  dimension  and 
potentially  strong  temporal,  spatial,  and  frequency  feature 
correlations  can  be  problematic.  The  dimension  of  the  input 
features  in  the  described  training  set  is  36864  (16x18x128), 
which  is  unrealistically  high  for  most  classifiers,  including  DL 
algorithms.  Reduction  of  feature  dimension  needs  to  be 
performed  before  these  features  can  be  used  for  DL-based 
classification.  Moreover,  the  features  in  adjacent  channels, 
frequency  bands,  and  time  samples  are  highly  correlated,  which 
might  affect  the  convergence  of  DL  algorithms.  We  used  ICA 
to  reduce  the  spatial  (EEG  channels)  feature  dimension  and 
correlation.  We  reduced  the  dimension  and  correlation  of 
temporal  and  frequency  features  using  down -samp  ling  and 
principal  component  analysis. 

III.  Results 

Feature  dimension  reduction  schemes  including  down-sampling 
and  principal  component  analysis  (PCA).  Parameters  for 
preprocessing  and  DL  algorithms  were  identified  using  grid 
search  on  the  data  from  subject  1.  Cross -validated  training  and 
prediction  of  non-time -locked  target  events  were  conducted  for 
each  of  the  remaining  4  subjects  using  the  dimension  reduction 
scheme  and  DL  parameters  that  led  to  the  best  performance  in 
subject  1. 

A.  Investigation  of  feature  dimension  reduction 
1)  Reduction  by  down-sampling 

The  impact  of  down  sampling  on  classification  was  tested  by 
comparing  results  from  4  different  orders  of  reduction  (i.e., 
down  sampling  by  a  factor  2,  4,  8  or  16).  LDA  was  used  as  the 
baseline  classifier  for  this  investigation.  Table  I  shows  the 
average  error  rates  for  5-fold  cross  validation  using  different 
down-sampling  factors.  The  lowest  error  rate  was  obtained  for 
a  down-sampling  factor  of  8.  As  a  result,  the  time  samples  were 
reduced  by  a  factor  of  8  from  the  original  128  to  16,  reducing 
the  feature  dimension  to  4608. 
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Subject  2 


Subject  3 


Down  sampling  factor 

2 

4 

8 

16 

Error  rate 

0.3724 

0.3657 

0.2959 

0.2981 

Table  L  Error  rate  of  LDA  for  different  downsampling 
factors 


2)  Reduction  by  principal  component  analysis 

Both  DSN  and  DNN  were  tested  on  the  down -samp  led  data, 
however  neither  resulted  in  convergence.  Correlation  of 
features  from  adjacent  frequency  bands  and  time  points 
contributed  to  the  poor  convergence  of  the  stochastic 
optimization  used  in  these  DL  algorithms.  To  overcome  this 
problem,  principal  component  analysis  was  applied  to  reduce 
the  correlations  among  features  over  time  and  frequency, 
further  reducing  feature  dimension.  Table  II  demonstrates  the 
classification  performance  of  LDA  both  before  PCA  and  after 
using  only  the  first  100  principal  components  (PCs).  It  is  clear 
that  applying  PCA  improves  the  classification  performance  of 
LDA.  We  use  the  first  100  PCs  based  on  the  results  of  grid 
search. 


Dimension  reduction 

Before  PCA 

After  PCA  (100)  PCs) 

Classifier 

LDA 

LDA 

Error  rate 

0.2959 

0.2044 

Table  IL  Error  rate  before  and  after  PCA 


B.  Investigation  of  deep  learning  paraineters 
For  both  DNN  and  DSN,  the  number  of  layeis  and  the  number 
of  hidden  units  in  each  layer  can  impact  the  classification 
performance.  Tuning  parameter  including  the  learning  rate  are 
also  important  factors.  We  determined  the  best  number  of 
hidden  units  and  layeis  as  well  as  the  values  of  other  tuning 
parameters  by  using  the  training  data  of  subject  1  with  5-fold 
cross  validation. 


DNN 

First  (100  units) 
layer 

First  (100  units)  +  second 
(50  units)  layer 

Error  rate 

0.1791 

0.1706 

Table  IIL  Error  rate  of  DNN  with  different  hidden  layers 

For  DNN,  two-layer  architectures  resulted  in  the  best 
performance  (Table  III);  using  more  than  3  layers  resulted  in 
poor  convergence.  For  DSN,  the  best  performance  was 
achieved  at  21  layers  with  60  hidden  units  for  0ms  offset  and 
14  layers  with  60  hidden  units  for  200ms  offset.  The  best 
performance  for  different  DL  algorithms  was  also  summarized 
in  Table  IV.  For  both  DNN  and  DSN,  only  one  iteration  of  fine 
tuning  was  applied.  The  performance  of  DNN  and  DSN  was 
similar,  with  DSN-S  achieving  a  slightly  lower  error  rate.  The 
results  for  0s  offset  are  also  consistently  better  than  those  for 
200ms  offset,  suggesting  that  training  data  with  0s  offset  better 
captures  the  brain  response  to  taiget  image.  It  is  likely  that 
some  of  the  taiget  event  training  samples  for  200ms  offsets 
contained  no  corresponding  brain  response  and  thus  were 
mislabeled.  The  resulting  values  for  DL  algorithms  are  fixed 
for  training  and  prediction  in  the  remaining  4  subjects  in  later 
sections  of  the  paper. 


Figure  1.  ROC  curves  for  subject  2-5 


Offset 

DSN-L 

DSN-S 

DNN 

0ms 

0.1693 

0.1666 

0.1706 

200ms 

0.1883 

0.1856 

0.1907 

Table  IV.  Average  error  rates  ofDL  algorithms 


C.  Training  ofDL  classifiers  for  individual  subjects 

We  proceeded  to  train  separate  DL  classifiers  for  subject  2  to  5. 
The  parameter  settings  obtained  in  Section  III.B  that  led  to  the 
best  performance  were  selected  for  the  training.  The  goal  of 
training  was  to  estimate  the  DL  weights  and  compare  the  DL 
performance  with  LDA.  5-fold  cross  validation  used  for 
evaluation. 

Figure  1  shows  the  ROC  curves  and  Area-Under-the-Curve 
(AUC)  statistics  of  the  trained  DL  algorithms  for  all  4  subjects 
for  0s  offset.  Consistent  with  results  reported  previously  [5,  6], 
the  AUC  performance  for  subjects  2  and  4  was  lower  than  that 
for  subjects  3  and  5.  The  three  DL  algorithms  achieved  similar 
performance.  All  the  DL  algorithms  improved  performance  by 
about  5%  in  AUC  for  all  subjects  when  compared  with  LDA. 

D.  Predictions  of  non-time -locked  target  events 

We  proceeded  to  predict  the  taiget  event  in  a  5 -second  epoch 
with  sliding  windows  using  the  trained  DSN-S  classifiers  for 
each  subject  (2-5),  respectively.  A  data  window  in  the  taiget 
event  region  of  a  target  5s -epoch  was  labeled  as  a  “taiget  event” 
while  those  in  the  remaining  regions  were  labeled  “non -taiget 
events”.  The  prediction  ROC  curve  is  shown  in  Fig.  2-A  and 
the  AUC  statistics  are  veiy  close  to  those  of  the  training  in  Fig. 
1,  suggesting  that  our  constiucted  training  dataset  w us 
sufficient  to  capture  the  characteristics  of  the  EEG  data  for  both 
target  and  non -target  events.  Once  again,  the  performance  for 
subjects  3  and  5  is  better.  This  performance  is  also  comparable 
to  that  of  the  time -locked  prediction  [4].  Examples  of  the 
prediction  results  in  both  target  and  non -taiget  epochs  are 
shown  in  Fig.  3.  We  observed  that  our  method  did  veiy  well  in 
predicting  the  target  event  regions  and  could  also  correctly 
predict  the  onset  of  the  taiget  event.  Overall,  the  false  negative 
predictions  were  small  and  made  towards  the  end  of  the  taiget 
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Figure  2.  ROC  (A)  and  Precision-Recall  (B)  curves  of 
prediction  by  sliding  window. 

event  region.  It  is  likely  that  towards  the  end  of  target  event 
region,  the  brain  response  to  target  images  has  already  faded, 
resulting  in  the  false  positive  predictions. 

To  further  investigate  the  effectiveness  of  our  method  in  the 
prediction  of  target  events,  we  plotted  in  Fig.  2-B  the 
precis  ion -recall  (PR)  curve  for  the  predictions  made  in  the 
target  event  epochs.  The  precision  is  defined  as  the  percentage 
of  correctly  predicted  target  events  among  all  data  windows 
that  were  predicted  as  a  target  event  at  a  given  decision 
threshold.  As  can  be  seen,  except  for  subject  4,  DSN-S  can 
achieve  and  maintain  100%  precision  until  the  prediction  recall 
reaches  almost  10%.  This  implies  that  top  10%  highly  ranked 
predictions  are  tme  positive  prediction  of  target  event.  Taken 
together,  these  preliminary  results  indicate  that  the  proposed 
sliding  window  method  for  predicting  non-time -locked  target 
events  may  be  able  to  achieve  a  performance  level  close  to  that 
of  the  time -locked  prediction  (Table  V). 


Subject 

2 

3 

4 

5 

DSN 

0.73 

0.85 

0.71 

0.82 

cLDA 

0.81 

0.91 

0.68 

0.88 

Table  V.  Comparison  of  AUCs  between  the  proposed 
DSN-based  sliding  window  method  for 
non-time-locked  event  prediction  (Fig.  2)  and  cLDA 
for  time-lock  event  prediction  reported  in  [5]. 

IV.  Conclusion  and  future  work 
We  presented  in  this  paper  an  investigation  of  deep  learning 
classifiers  based  on  the  architectures  of  the  DSN  and  DNN  for 
automatic  classification  of  non -time -locked  image  RSVP 
events.  The  preliminary  results  obtained  from  analyzing  five 
subjects,  one  for  training  and  four  for  validation,  indicate  that 
deep  learning  may  be  able  to  improve  the  prediction  error  rate 
by  about  5%  over  other  existing  mainstream  methods  for  this 
task.  In  addition,  we  provided  preliminary  results  showing  that 
a  sliding  window  method  based  on  the  DSN  produced  an  error 
rate  similar  to  that  for  prediction  of  time -locked  events.  Our 
study  in  this  paper  suggests  that  deep  learning  has  a  strong 
potential  to  be  a  powerful  tool  for  BCI  research.  However, 


A1.  Target  epoch  1  (error  rate  O  08)  B1.  Non-target  epoch  1  (error  rate  0  23) 
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'0  125  2.5  3.75  5  *\)  1.25  2.5  3.75  5 

Time  (Second)  Time  (Second) 

Figure  3.  Examples  of  predictions  by  sliding-window. 

The  left  column  includes  epochs  with  target  events.  The 
target  event  regions  are  highlighted  by  green.  The  vertical 
axis  denotes  the  probably  of  predicting  a  target  event.  The 
line  at  0.5  represents  the  decision  threshold 

careful  extraction  of  features  from  the  EEG  signal  to  feed  into 
the  existing  DSN  or  DNN  architectures  are  likely  to  further 
improve  the  classification  accuracy  in  this  application  domain. 
How  to  incorporate  the  stage  of  modeling  EEG  features  into  the 
deep  learning  architecture  will  be  a  topic  of  future  study.  The 
main  challenge  is  to  take  into  account  both  temporal  and  spatial 
correlations  in  the  observed  data  exhibiting  variable 
dimensionality.  This  is  a  popular  topic  in  deep  learning 
research  with  other  application  domains  [17,  18,  19]  that  is 
likely  to  help  the  BQ  application  discussed  in  this  paper. 
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